<!DOCTYPE html>
<html lang="en">

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>cpplib TODO</title>
<link rel="stylesheet" type="text/css" href="https://gcc.gnu.org/gcc.css" />
</head>

<body>
<h1>Projects relating to cpplib</h1>

<p><em>Note: this writeup represents state as of 2002.</em></p>

<p>cpplib has largely been completed, and is stable at this point.
For GCC versions 3.0 and later, it is linked into the C, C++ and
Objective C front ends.  Most future work will relate to character set
issues, performance enhancements and improving cpplib as a stand-alone
library.</p>

<h2>Work recently completed</h2>

<ol>
  <li>Stand-alone CPP is dead.  The compiler front end now handles
      preprocessed output if necessary.</li>

  <li>As many built-in macros as possible have been moved to the front
      ends, and out of SPECS and cpplib itself (some targets still
      in progress).</li>

  <li>CPP arithmetic is now done to the correct target precision, based
      upon the selected language standard.</li>

  <li>The traditional preprocessor has been integrated into cpplib.
      At present it is an output-only preprocessor, but it should be
      fairly simple to modify cpplib so that traditional preprocessing
      and then tokenization are performed in one invocation.</li>
</ol>

<h2>Greater Coordination with the Front Ends</h2>

<p>The integrated preprocessor would benefit from greater integration
with the front ends.  It still feels like it has been tacked on as an
after thought, which is not entirely coincidental.</p>

<ol>
  <li>Character sets that are strict supersets of ASCII are safe to
      use, but extended characters cannot appear in identifiers.  This
      has to be coordinated with the C and C++ front ends.  See <a
      href="#charset">character set issues</a>, below.</li>

  <li>C99 universal character escapes (<code>\uxxxx</code>,
      <code>\Uxxxxxxxx</code>) are not recognized in identifiers.
      Proper support has to be coordinated with the front ends.</li>

  <li>Precompiled headers are commonly requested; this entails the
      ability for cpp to dump out and reload all its internal state.
      You can get some of this with the debug switches, but not all,
      and not in a reloadable format.  The front end must cooperate
      also.</li>

  <li>Integration of diagnostic reporting.  The front ends could use
      extra information only available to the preprocessor, such as
      column numbers and macros under expansion.  The existing code
      copies cpplib's internal state into the state used by
      <code>diagnostic.c</code>, which is better than writing out and
      processing linemarker commands, but still suboptimal.</li>

  <li>If YACC did not insist on assigning its own values for token
      codes, there would be no need for a translation layer between
      the codes returned by cpplib and the codes used by the parser.
      Noises have been made about a recursive-descent parser that
      could handle all of C, C++, Objective C; if this ever happens,
      it should use cpplib's token codes.</li>

  <li>String concatenation should be handled in the function
      <code>c_lex</code> in <code>c-lex.c</code>.  Then the front ends
      would not have to jump through hoops to remember to concatenate
      strings, and we could simplify the parsers a little too.</li>
</ol>

<h2>Potential minor improvements</h2>

<ol>
  <li>The file-handling code allocates lots of items with xmalloc.
      The rest of cpplib is now reasonably efficient in its use of
      memory; minor improvements are certainly still possible.</li>

  <li>There might be room for further improvement of macro expansion
      performance, although it is now pretty good.  For example, we
      currently pre-expand each argument (if necessary) into its own
      buffer, replace the arguments in the replacement list with their
      expansions, and then free up each buffer.  It might be better to
      simply expand the arguments into the final argument-replaced
      expansion, saving one copy per argument and the need to free the
      argument expansion buffers.  It has the disadvantage that we
      don't know the size we need to make the token buffer in advance
      [equally, though, we don't know the size we need to make each
      expanded argument buffer, either].  In view of this, a further
      enhancement might then be to permit the list of token pointers
      that represents the expansion to be made up of more than one
      run.  Then we would just need to append a new run, rather than
      reallocating the expansion buffer if we overflow its initial
      bounds.</li>

  <li>It might be worth trying to optimize wrapper headers - files
      containing only an #include of another file, so that they are
      optimized out on reinclusion.  This is more tricky than it may
      sound - something with heuristics similar to the
      multiple-include optimization is needed, that handles multiple
      levels of wrapper headers.</li>
</ol>

<h2 id="charset">Character set issues</h2>

<p>Proper non-ASCII character handling is a hard problem.  Users want
to be able to write comments and strings in their native language.
They want the strings to come out in their native language and not
gibberish after translation to object code.  Some users also want to
use their own alphabet for identifiers in their code.  There is no
one-to-one or many-to-one map between languages and character set
encodings.  The subset of ASCII that is included in most modern day
character sets does not include all the punctuation C uses; some of
the missing punctuation may be present but at a different place than
where it is in ASCII.  The subset described in ISO646 may not be the
smallest subset out there.</p>

<p>At the present time, GCC supports the use of any encoding for
source code, as long as it is a strict superset of 7-bit ASCII.  By
this I mean that all printable (including whitespace) ASCII
characters, when they appear as single bytes in a file, stand only for
themselves, no matter what the context is.  This is true of ISO8859.x,
KOI8-R, and UTF8.  It is not true of Shift JIS and some other popular
Asian character sets.  If they are used, GCC may silently mangle the
input file.  The only known specific example is that a Shift JIS
multibyte character ending with 0x5C will be mistaken for a line
continuation if it occurs at the end of a line.  0x5C is "\" in ASCII.</p>

<p>Assuming a safe encoding, characters not in the base set listed in
the standard (C99 5.2.1) are syntax errors if they appear outside
strings, character constants, or comments.  In strings and character
constants, they are taken literally - converted blindly to numeric
codes, or copied to the assembly output verbatim, depending on the
context.  If you use the C99 <code>\u</code> and <code>\U</code>
escapes, you get UTF8, no exceptions.  These too are only supported in
string and character constants.</p>

<p>We intend to improve this as follows:</p>

<ol>
  <li>cpplib will be reworked so that it can handle any character set
      in wide use, whether or not it is a strict superset of 7-bit
      ASCII.  This means that cpplib will never confuse non-ASCII
      characters with C punctuators, comment delimiters, or whatever.
      </li>

  <li>In comments, naturally any character will be permitted to appear.
      </li>

  <li>All Unicode code points which are permitted by C99 Annex D to
      appear in identifiers, will be accepted in identifiers.  All
      source-file characters which, when translated to Unicode,
      correspond to permitted code points, will also be accepted.  In
      assembly output, identifiers will be encoded in UTF8, and then
      reencoded using some mangling scheme if the assembler cannot
      handle UTF8 identifiers.  (Does the new C++ ABI have anything to
      say about this?  What does the Java compiler do?)
      <br />
      Unicode <code>U+0024</code> will be permitted in
      identifiers if and only if <code>$</code> is permitted.</li>

  <li>In strings and character constants, GCC will translate from the
      character set of the file (selectable on a per-file basis), to
      the current execution character set (chosen once per
      compilation).  This may or may not be Unicode.  UCN escapes will
      also be converted from Unicode to the execution character set;
      this happens independent of the source character set.</li>

  <li>Each file referenced by the compiler may state its own character
      set with a <code>#pragma</code>, or rely on the default
      established by the user with locale or a command-line option.
      The <code>#pragma</code>, if used, must be the first line in
      the file.  This will not prevent the multiple include
      optimization from working.  GCC will also recognize MULE
      (Multilingual Emacs) magic comments, byte order marks, and any
      other reasonable in-band method of specifying a file's character set.
      </li>
</ol>

<p>It's worth noting that the standard C library facilities for
"multibyte character sets" are not adequate to implement the above.
The basic problem is that neither C89 nor C99 gives you any way to
specify the character set of a file directly.  You can manipulate the
"locale," which indirectly specifies the character set, but that's a
global change.  Further, locale names are not defined by the C
standard nor is there any consistent map between them and character
sets.</p>

<p>The Single Unix specification, and possibly also POSIX, provide the
<code>nl_langinfo</code> and <code>iconv</code> interfaces which
mostly circumvent these limitations.  We may require these interfaces
to be present for complete non-ASCII support to be functional.</p>

<p>One final note: EBCDIC is, and will be, supported as a source
character set if and only if GCC is compiled for a host (not a target)
which uses EBCDIC natively.</p>
</body>
</html>
